:Example 5: Allow all access requests, but deny access requests from specific IP or IP segments (block access to malicious IP or rogue crawler segments)Configuration under Apache2.4:Example 6: Allow all access requests, but deny access to certain user-agent (via user-agent block spam crawler)Use Mod_setenvif to match the user-agent of a visiting request with a regular expression, set the internal environment variable Badbot, and finally deny the
Apache①, by modifying the. htaccess file to modify the. htaccess in the Site directory, add the following codeRewriteengine on Rewritecond%{http_user_agent} (^$| feeddemon| jikespider| Indy) [NC] rewriterule ^ (. *) $-[F]②, by modifying the httpd.conf configuration file find a similar location below, add/modify according to the following code, then restart Apache: documentroot/home/wwwroot/xxx setenvifnocase user-agent ". * (feeddemon| jikespider| Indy) "Ba
: Allow all access requests, but deny access requests from specific IP or IP segments (block access to malicious IP or rogue crawler segments)Configuration under Apache2.4:Example 6: Allow all access requests, but deny access to certain user-agent (via user-agent block spam crawler)Use Mod_setenvif to match the user-agent of a visiting request with a regular expression, set the internal environment variable Badbot, and finally deny the
" effective, to prevent "villain" to use the 3rd strokes ("Gentleman" and "villain" respectively refers to abide by and do not comply with the robots.txt agreement spider/robots), so the site after the online to keep track of the analysis of the log, screening out these Badbot IP, and then block it.Here's a Badbot IP database: http://www.spam-whackers.com/bad.bots.htm4, through the search engine provides we
(or you can create an empty file "/robots.txt" file)
User-Agent: * disallow:
example 3. Disable access to a search engine
User-Agent: badbot disallow:/
Example 4. allow access to a search engine
User-Agent: baiduspider disallow: User-Agent: * disallow: /
Example 5. A simple example in this example, the website has three directories that limit the access to the search engine
www.seovip.cn.
Specific syntax analysis: The # text is the description information, the User-Agent is the name of the search robot, and the * text is the name of all search robots. disallow: the following is the file directory that cannot be accessed.
Next, let me list the specific usage of robots.txt:
Allow access by all robots
User-Agent :*Disallow:
Alternatively, you can create an empty file "/robots.txt" file.
Prohibit all search engines from accessing any part of the website
User-Agent :*D
. At least one disallow record is required in the "robots.txt" file. If "robots.txt" is an empty file, the website is open to all search engine robots.
Below are some basic usage of robots.txt:
Prohibit all search engines from accessing any part of the website:User-Agent :*Disallow :/
Allow access by all robotsUser-Agent :*Disallow:Alternatively, you can create an empty file: robots.txt.
Prohibit all search engines from accessing the website (cgi-bin, TMP, and private directories in the f
. Example: robot robots.txt file from http://www.shijiazhuangseo.com.cn # All robots will spider the domainuser-AGENT: * disallow: The above text represents allowing all search robots to access all files under the site www.shijiazhuangseo.com.cn. Specific syntax analysis: The # text is the description information, the User-Agent is the name of the search robot, and the * text is the name of all search robots. disallow: the following is the file directory that cannot be accessed. Next, let me lis
search robot determines the access range based on the content in the file. If the file does not exist, the search robot crawls the link. In addition, robots.txt must be placed in the root directory of a site, and all file names must be in lowercase. The compilation of Robots.txt is very simple. I will not repeat it here because there is a lot of information on the Internet. Only a few common examples are provided. (1) Prohibit all search engines from accessing any part of the website. User-agen
website. The following describes how to disable them:
#get rid of the bad botRewriteEngine onRewriteCond %{HTTP_USER_AGENT} ^BadBotRewriteRule ^(.*)$ http://go.away/
The preceding section disables a crawler. If you want to disable multiple crawlers, you can configure it in. Htaccess as follows:
#get rid of bad botsRewriteEngine onRewriteCond %{HTTP_USER_AGENT} ^BadBot [OR]RewriteCond %{HTTP_USER_AGENT} ^EvilScraper [OR]RewriteCond %{HTTP_USER_AGENT}
all parts of the site are allowed to be accessed, and that at least one disallow record must be in the "robots.txt" file. If "robots.txt" is an empty file, then for all search engine robot, the site is open。Here are some basic uses of robots.txt:All search engines are prohibited from accessing any part of the site:User-agent: *Disallow:/Allow all robot to accessUser-agent: *Disallow:Or you can build an empty file: robots.txtProhibit all search engines from accessing several parts of the site (C
)User-Agent :*Disallow:/cgi-bin/Disallow:/tmp/Disallow:/private/
L prohibit access to a search engine (badbot in the following example)User-Agent: badbotDisallow :/
L only allow access to a search engine (webcrawler in the following example)User-Agent: webcrawlerDisallow:
User-Agent :*Disallow :/
3. Names of common search engine robots
Name Search Engine
Baiduspider:Http://www.baidu.com
Scooter:Http://www.altavista.com
Ia_archiver:Http://www.alexa.com
" file.
Prohibit all search engines from accessing any part of the website
User-Agent :*Disallow :/
Prohibit all search engines from accessing the website (in the following example, the 01, 02, and 03 Directories)
User-Agent :*Disallow:/01/Disallow:/02/Disallow:/03/
Prohibit Access to a search engine (badbot in the following example)
User-Agent: badbotDisallow :/
Only access to a search engine is allowed (The crawler in the following example)
User-Age
any part of the website:User-Agent :*Disallow :/
L allow access by all robotsUser-Agent :*Disallow:Alternatively, you can create an empty file "/robots.txt" File
L prohibit all search engines from accessing the website (cgi-bin, TMP, and private directories in the following example)User-Agent :*Disallow:/cgi-bin/Disallow:/tmp/Disallow:/private/
L prohibit access to a search engine (badbot in the following example)User-Agent: badbotDisallow :/
L only
;
User-agent:
The name of the robot to search.
*,
All search robots;
Disallow:
The following is the file directory that cannot be accessed.
Next, let me list the specific usage of robots.txt:
Allow access by all robots
User-agent :*Disallow:
Alternatively, you can create an empty file "/robots.txt" file.
Prohibit all search engines from accessing any part of the website
User-agent :*Disallow :/
Prohibit all search engines from accessing the website (in the following example, the 01, 02,
You can create the robots.txt file under the website root directory to guide the search engine to include websites. Googlespider googlebotbaiduspider baiduspidermsnspider msnbotrobots.txt the writing syntax allows all robots to access User-agent: * Disallow: Or User-agent: * Allow: Or you can create an empty
In the root directory of the website, you can also create the robots.txt file to guide the search engine to include the website. Google spider GoogleBot BaiDu spider baiduspmsn spider MSNBOT
from accessing any part of the site:
User-agent: *
Disallow:/
L Allow all robot access
User-agent: *
Disallow:
Or you can build an empty file "/robots.txt" files
• Prohibit all search engines from accessing several parts of the site (Cgi-bin, TMP, Private directory in the following example)
User-agent: *
Disallow:/cgi-bin/
Disallow:/tmp/
Disallow:/private/
• Prohibit access to a search engine (Badbot in the following example)
User-agent:badbot
Dis
: The above text is meant to allow all search bots to access all the files under the www.csswebs.org site. The specific grammatical analysis: in which the following text is the description information; User-agent: The name of the search robot, followed by *, refers to all the search robot; Disallow: A file directory that is not allowed to be accessed later. below, I'll enumerate some specific uses of robots.txt: allow all robot to access user-agent: * Disallow: or you can build a
Contact Us
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.